An Association-Based Method for Automatic Indexing with a Controlled Vocabulary

نویسندگان

  • Christian Plaunt
  • Barbara A. Norgard
چکیده

In this paper we describe and test a two stage algorithm based on a lexical collocation technique which maps from the lexical clues contained in a document representation into a controlled vocabulary list of subject headings. Using a collection of 4,626 INSPEC documents, we create a “dictionary” of associations between the lexical items contained in the titles, authors and abstracts and controlled vocabulary subject headings assigned to those records by human indexers using a likelihood ratio statistic as the measure of association. In the deployment stage, we use the dictionary to predict which of the controlled vocabulary subject headings best describe new documents when they are presented to the system. Our evaluation of this algorithm, in which we compare the automatically assigned subject headings to the subject headings assigned to the test documents by human catalogers, shows that we can obtain results comparable to and consistent with human cataloging. In effect, we have cast this as a classic partial match information retrieval problem. We consider the problem to be one of “retrieving” (or assigning) the most probably “relevant” (or correct) controlled vocabulary subject headings to a document based on the clues contained in that document. ∗To whom all correspondence should be addressed

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Approach to Automatic Indexing of Scientific Publications in High Energy Physics for Database SPIRES HEP

We introduce an approach to automatic indexing of e-prints based on a patternmatching technique making extensive use of an Associative Patterns Dictionary (APD), developed by us. Entries in the APD consist of natural language phrases with the same semantic interpretation as a set of keywords from a controlled vocabulary. The method also allows to recognize within e-prints formulae written in TE...

متن کامل

Bibliographic database access using free-text and controlled vocabulary: an evaluation

This paper evaluates and compares the retrieval effectiveness of various search models, based on either automatic text-word indexing or on manually assigned controlled descriptors. Retrieval is from a relatively large collection of bibliographic material written in French. Moreover, for this French collection we evaluate improvements that result from combining automatic and manual indexing. Fir...

متن کامل

Automatic Indexing for Research Papers Using References

An effective way to reveal the contents of research papers is assigning a group of terms against a controlled vocabulary. To the best of our knowledge, a variety of automatic indexing techniques have been studied to enhance the effectiveness and the efficiency. However, the current approaches depended on the content of a research paper, such as title, abstract, etc., which suffering from limita...

متن کامل

LOHAI: Providing a Baseline for KOS based Automatic Indexing

Automatic KOS based indexing – i.e. indexing based on a restricted, controlled vocabulary, a thesaurus or a classification – can play an important role to close the gap between the intellectually, high quality indexed publications and the mass of unindexed publications. Especially for unknown, heterogeneous publications, like web publications, simple processes that do not rely on manually creat...

متن کامل

Comparing a rule-based versus statistical system for automatic categorization of MEDLINE documents according to biomedical specialty

Automatic document categorization is an important research problem in Information Science and Natural Language Processing. Many applications, including Word Sense Disambiguation and Information Retrieval in large collections, can benefit from such categorization. This paper focuses on automatic categorization of documents from the biomedical literature into broad discipline-based categories. Tw...

متن کامل

Semantically Enhanced Automatic Keyphrase Indexing

The goal of this PhD thesis is to elaborate methods for automatic keyphrase indexing with a controlled vocabulary. Keyphrases are single words or multi-word lexemes that concisely and accurately describe the subject or an aspect of the subject discussed in a document. They are widely used in large document collections such as digital libraries and document repositories. They help organize mater...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • JASIS

دوره 49  شماره 

صفحات  -

تاریخ انتشار 1998